Observational study;
Causal Inference;
Matching;
R
Welders are exposed to chromium and nickel, substances that can cause inappropriate links between DNA and proteins, which in turn may disrupt gene expression or interfere with replication of DNA. Costa, Zhitkovich, and Toniolo [10] measured DNA-protein cross-links in samples of white blood cells from 47 subjects (all male). 01
Unmatched data for 21 railroad arc welders and 26 potential controls. Covariates are age, race (C=Caucasian, AA=African American), current smoker (Y=yes, N=no). Response is DPC=DNA-protein cross-links in percent in white blood cells. All 47 subjects are male.
Welders — Controls
| ID | Age | Race | Smoker | DPC | ID | Age | Race | Smoker | DPC | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 38 | C | N | 1.77 | 1 | 48 | AA | N | 1.08 | |
| 2 | 44 | C | N | 1.02 | 2 | 63 | C | N | 1.09 | |
| 3 | 39 | C | Y | 1.44 | 3 | 44 | C | Y | 1.1 | |
| 4 | 33 | AA | Y | 0.65 | 4 | 40 | C | N | 1.1 | |
| 5 | 35 | C | Y | 2.08 | 5 | 50 | C | N | 0.93 | |
| 6 | 39 | C | Y | 0.61 | 6 | 52 | C | N | 1.11 | |
| 7 | 27 | C | N | 2.86 | 7 | 56 | C | N | 0.98 | |
| 8 | 43 | C | Y | 4.19 | 8 | 47 | C | N | 2.2 | |
| 9 | 39 | C | Y | 4.88 | 9 | 38 | C | N | 0.88 | |
| 10 | 43 | AA | N | 1.08 | 10 | 34 | C | N | 1.55 | |
| 11 | 41 | C | Y | 2.03 | 11 | 42 | C | N | 0.55 | |
| 12 | 36 | C | N | 2.81 | 12 | 36 | C | Y | 1.04 | |
| 13 | 35 | C | N | 0.94 | 13 | 41 | C | N | 1.66 | |
| 14 | 37 | C | N | 1.43 | 14 | 41 | AA | Y | 1.49 | |
| 15 | 39 | C | Y | 1.25 | 15 | 31 | AA | Y | 1.36 | |
| 16 | 34 | C | N | 2.97 | 16 | 56 | AA | Y | 1.02 | |
| 17 | 35 | C | Y | 1.01 | 17 | 51 | AA | N | 0.99 | |
| 18 | 53 | C | N | 2.07 | 18 | 36 | C | Y | 0.65 | |
| 19 | 38 | C | Y | 1.15 | 19 | 44 | C | N | 0.42 | |
| 20 | 37 | C | N | 1.07 | 20 | 35 | C | N | 2.33 | |
| 21 | 38 | C | Y | 1.63 | 21 | 34 | C | Y | 0.97 | |
| 22 | 39 | C | Y | 0.62 | ||||||
| 23 | 45 | C | N | 1.02 | ||||||
| 24 | 42 | C | N | 1.78 | ||||||
| 25 | 30 | C | N | 0.95 | ||||||
| 26 | 35 | C | Y | 1.59 |
# loadind data
gen.tox <- read_excel("data/table81.xlsx", sheet = "data")
data.tab <- gen.tox[c(1:3, 22:24),]
kable(data.tab, caption = "Generic Toxicology Dataset (6-observation example)") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| ID | Group | Age | Race | Smoker | DPC |
|---|---|---|---|---|---|
| 1 | Welders | 38 | C | N | 1.77 |
| 2 | Welders | 44 | C | N | 1.02 |
| 3 | Welders | 39 | C | Y | 1.44 |
| 22 | Controls | 48 | AA | N | 1.08 |
| 23 | Controls | 63 | C | N | 1.09 |
| 24 | Controls | 44 | C | Y | 1.10 |
m.age.w <- round(mean(gen.tox$Age[gen.tox$Group=="Welders"]),0)
m.age.c <- round(mean(gen.tox$Age[gen.tox$Group=="Controls"]),0)
aa.p.w <- round(table(gen.tox$Race[gen.tox$Group=="Welders"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Welders"]))*100,0)
aa.p.c <- round(table(gen.tox$Race[gen.tox$Group=="Controls"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Controls"]))*100,0)
s.p.w <- round(table(gen.tox$Smoker[gen.tox$Group=="Welders"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Welders"]))*100,0)
s.p.c <- round(table(gen.tox$Smoker[gen.tox$Group=="Controls"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Controls"]))*100,0)Welders — Controls
| Mean Age | AA | Smoker | — | Mean Age | AA | Smoker |
|---|---|---|---|---|---|---|
| 38 | 10 | 52 | — | 43 | 19 | 35 |
t.test(Age ~ Group, data=gen.tox)
Welch Two Sample t-test
data: Age by Group
t = 2.2537, df = 42.167, p-value = 0.02947
alternative hypothesis: true difference in means between group Controls and group Welders is not equal to 0
95 percent confidence interval:
0.4662145 8.4422104
sample estimates:
mean in group Controls mean in group Welders
42.69231 38.23810
tbl.age.group <- table(gen.tox$Group, gen.tox$Race)
tbl.age.group
AA C
Controls 5 21
Welders 2 19
chisq.test(tbl.age.group)
Pearson's Chi-squared test with Yates' continuity correction
data: tbl.age.group
X-squared = 0.26754, df = 1, p-value = 0.605
fisher.test(tbl.age.group, conf.int = T, conf.level = 0.95)
Fisher's Exact Test for Count Data
data: tbl.age.group
p-value = 0.4364
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3170452 25.9844336
sample estimates:
odds ratio
2.22477
The welders are about 5 years younger than the controls on average, have relatively fewer African Americans, and more smokers.
Given the t-test (t = 2.25) and a two-sided significance level of 0.03 showed the difference in age between welders and controls is
The 2 binary variables race & smoking are neither significant by that standard nor by Fisher’s exact test for a 2×2 table.
Data genetic toxicity had \(L\) = 47 subjects, \(l = 1,2, ..., L\).
For subject \(l\), the observed covariate, \(\textbf{x}_l\), is 3-dimensional, \(\textbf{x}_l = (x_{l1}, x_{l2}, x_{l3})^T\) , where
\(x_{l1}\) is the age
\(x_{l2}\) encodes race, \(x_{l2}\) = 1 if \(l\) is African American, \(x_{l2}\) =0 if \(l\) is Caucasian,
\(x_{l3}\) encodes smoking, \(x_{l3}\) = 1 if \(l\) is is a current smoker, \(x_{l3}\) = 0 otherwise.
\(\rightarrow\) For instance, \(x_1\) = \((38,0,0)^T\) .
gen.tox$Race <- ifelse(gen.tox$Race=="C", 0, 1)
gen.tox$Smoker <- ifelse(gen.tox$Smoker=="N", 0, 1)PS is the conditional probability of exposure to
treatment given the observed covariates, \(e
(\textbf{X}) = Pr(\textbf{Z} = 1| \textbf{x})\).PS is defined in terms of the observed covariates,
x, even though there is invariably concern about other covariates that
were not measured.PS:
Brief examination of Genetic Toxicity data suggested:
at least age, \(x_1\), and
possibly race and smoking, \(x_2\) and
\(x_3\), could be used to predict
treatment assignment \(Z\), so that the
PS is not constant.
Saying that welders tend to be somewhat younger than controls is much the same as saying that the chance of being a welder is lower for someone who is older, i.e. \(e(\textbf{X})\) is lower when \(x_1\) is higher.
If 2 subjects have the same PS \(e(\textbf{X})\), they may have different
values of \(X\) \(\rightarrow\) subjects were matched for
\(e(\textbf{X})\), they may be
mismatched for x \(\rightarrow\) the
mismatches in x will be due to chance and will tend to balance,
particularly in large samples. ( e.g. if young
nonsmokers and old smokers have the same PS, then a match
on the PS may pair a young nonsmoking welder to an old
smoking control, but it will do this about as often as it pairs an old
smoking welder to a young nonsmoking control) \(\rightarrow\) matching on \(e(\textbf(X))\) tends to balance \(\textbf{X}\) \(\rightarrow\) \(Z\)| \(e(\textbf{X})\) indep. \(X\)
In short,
matching on \(e(\textbf{X})\) is often practical even when there are many covariates in x because \(e(\textbf{X})\) is a single variable,
failure to balance \(e(\textbf{X})\) implies that \(X\) is not balanced
matching on \(e(\textbf{X})\) tends to balance all of \(X\). On the negative side, success in balancing the observed covariates, \(X\), provides no assurance that unmeasured covariates are balanced
\(\rightarrow\) For example, the men whose his father worked as a welder is associated with a much higher chance that the son also will work as a welder, so father’s occupation is an unmeasured covariate that could be used to predict treatment \(Z\); then success in matching on the propensity score \(e(\textbf{X})\) would tend to balance age, race and smoking, but there is no reason to expect it to balance father’s occupation.
PSThe PS is unknown, but it can be estimated from the data
at hand. PS is estimated by a linear logit model \[ log {\frac{e(\textbf{X}_l)}{1 -
e(\textbf{X}_l)}} = \zeta_0 + \zeta_1 x_{l1} + \zeta_2 x_{l2} + \zeta_3
x_{l3}\]
and the fitted values \(\hat{e}(\textbf{X}_l)\) from this model are
the estimates of the PS.
#fit a propensity score model. logistic regression
gen.tox$Group <- relevel(as.factor(gen.tox$Group), ref="Controls")
psmodel <- glm(Group ~ Age + Race + Smoker, family=binomial(), data=gen.tox)
#show coefficients etc
summary(psmodel)
Call:
glm(formula = Group ~ Age + Race + Smoker, family = binomial(),
data = gen.tox)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4968 -1.0216 -0.5245 1.0786 1.8186
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.06543 2.14200 1.431 0.152
Age -0.08503 0.05178 -1.642 0.101
Race -0.76900 0.98252 -0.783 0.434
Smoker 0.55095 0.65212 0.845 0.398
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 64.623 on 46 degrees of freedom
Residual deviance: 58.706 on 43 degrees of freedom
AIC: 66.706
Number of Fisher Scoring iterations: 3
#create propensity score
pscore<-psmodel$fitted.values
gen.tox$ps <- round(pscore,2)
# mean of PS in Welders group
mean.ps.welder <- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)
# mean of PS in 21 largest Controls group
mean.ps.21largestcontrols <- gen.tox %>%
filter(Group=="Controls") %>%
top_n(21, ps) %>% # wrapper that uses filter() and min_rank() to select the top or bottom entries in each group, ordered by ps.
arrange(desc(ps)) %>% # order data follow ps decreasing
summarize(round(mean(ps),2)) %>%
as.numeric()Estimated PS for 21 railroad arc welders and 26
potential controls. Covariates are age, race (C=Caucasian, AA=African
American), current smoker (Y=yes, N=no).
| ID | Age | Race | Smoker | DPC | \(\hat{e}(\textbf{X})\) | ID | Age | Race | Smoker | DPC | \(\hat{e}(\textbf{X})\) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 38 | C | N | 1.77 | 0.46 | 1 | 48 | AA | N | 1.08 | 0.14 |
| 2 | 44 | C | N | 1.02 | 0.34 | 2 | 63 | C | N | 1.09 | 0.09 |
| 3 | 39 | C | Y | 1.44 | 0.57 | 3 | 44 | C | Y | 1.1 | 0.47 |
| 4 | 33 | AA | Y | 0.65 | 0.51 | 4 | 40 | C | N | 1.1 | 0.42 |
| 5 | 35 | C | Y | 2.08 | 0.65 | 5 | 50 | C | N | 0.93 | 0.23 |
| 6 | 39 | C | Y | 0.61 | 0.57 | 6 | 52 | C | N | 1.11 | 0.2 |
| 7 | 27 | C | N | 2.86 | 0.68 | 7 | 56 | C | N | 0.98 | 0.15 |
| 8 | 43 | C | Y | 4.19 | 0.49 | 8 | 47 | C | N | 2.2 | 0.28 |
| 9 | 39 | C | Y | 4.88 | 0.57 | 9 | 38 | C | N | 0.88 | 0.46 |
| 10 | 43 | AA | N | 1.08 | 0.2 | 10 | 34 | C | N | 1.55 | 0.46 |
| 11 | 41 | C | Y | 2.03 | 0.53 | 11 | 42 | C | N | 0.55 | 0.54 |
| 12 | 36 | C | N | 2.81 | 0.5 | 12 | 36 | C | Y | 1.04 | 0.38 |
| 13 | 35 | C | N | 0.94 | 0.52 | 13 | 41 | C | N | 1.66 | 0.64 |
| 14 | 37 | C | N | 1.43 | 0.48 | 14 | 41 | AA | Y | 1.49 | 0.4 |
| 15 | 39 | C | Y | 1.25 | 0.57 | 15 | 31 | AA | Y | 1.36 | 0.35 |
| 16 | 34 | C | N | 2.97 | 0.54 | 16 | 56 | AA | Y | 1.02 | 0.55 |
| 17 | 35 | C | Y | 1.01 | 0.65 | 17 | 51 | AA | N | 0.99 | 0.13 |
| 18 | 53 | C | N | 2.07 | 0.19 | 18 | 36 | C | Y | 0.65 | 0.12 |
| 19 | 38 | C | Y | 1.15 | 0.6 | 19 | 44 | C | N | 0.42 | 0.64 |
| 20 | 37 | C | N | 1.07 | 0.48 | 20 | 35 | C | N | 2.33 | 0.34 |
| 21 | 38 | C | Y | 1.63 | 0.6 | 21 | 34 | C | Y | 0.97 | 0.52 |
| 22 | 39 | C | Y | 0.62 | |||||||
| 23 | 45 | C | N | 1.02 | |||||||
| 24 | 42 | C | N | 1.78 | |||||||
| 25 | 30 | C | N | 0.95 | |||||||
| 26 | 35 | C | Y | 1.59 |
ps.w <- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)
ps.c <- round(mean(gen.tox$ps[gen.tox$Group=="Controls"]),2)Welders — Controls
| Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
|---|---|---|---|---|---|---|---|---|
| 38 | 10 | 52 | 0.51 | — | 43 | 19 | 35 | 0.4 |
Welders #10 and #18 have similar \(\hat{e}(\textbf{x})\) but different patterns of covariates \(X\).
The limitations of pair matching, apply with any variable,
including the PS. The mean of the 21 largest \(\hat{e}(\textbf{x})\)’s in the control
group is 0.46, somewhat less than the mean of \(\hat{e}(\textbf{x})\)) in the treated
group, namely 0.51, so no pair matching can completely close that
gap.
A distance matrix is a table with one row for each
treated subject and one column for each potential control.
The value in row \(i\) and column \(j\) of the table is the
distance between the \(i^{th}\) treated subject and the \(j^{th}\) potential control. 2 individuals
with the same value \(\mathbf{x}\)
would have distance zero.
The welder data: 21 rows x 26 columns (21x26) distance. The distance is the squared difference in the \(\hat{e}(\textbf{x})\)
ps.vec.c <- gen.tox$ps[gen.tox$Group=="Controls"]
ps.vec.w <- gen.tox$ps[gen.tox$Group=="Welders"]
d.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
for (j in 1:26){
d.m[i,j] <- round((ps.vec.w[i] - ps.vec.c[j])^2, 2)
}
}Table. Squared differences in PS
between welders and controls: \((\hat{e}(\textbf{X})_{Welder} -
\hat{e}(\textbf{X})_{Control})^2\)
Rows are the 21 welders and columns are for the first 6 of 26 potential controls.
Welder <- c(1:21)
d.m.d <- as.data.frame(d.m)
names(d.m.d) <- paste0("Control ", 1:26)
d.m.d.n <- as.data.frame(cbind(Welder, d.m.d))
kable(d.m.d.n[,1:7], caption = "Squared differences in `PS` between welders and controls") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
|---|---|---|---|---|---|---|
| 1 | 0.10 | 0.14 | 0.00 | 0.00 | 0.05 | 0.07 |
| 2 | 0.04 | 0.06 | 0.02 | 0.01 | 0.01 | 0.02 |
| 3 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
| 4 | 0.14 | 0.18 | 0.00 | 0.01 | 0.08 | 0.10 |
| 5 | 0.26 | 0.31 | 0.03 | 0.05 | 0.18 | 0.20 |
| 6 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
| 7 | 0.29 | 0.35 | 0.04 | 0.07 | 0.20 | 0.23 |
| 8 | 0.12 | 0.16 | 0.00 | 0.00 | 0.07 | 0.08 |
| 9 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
| 10 | 0.00 | 0.01 | 0.07 | 0.05 | 0.00 | 0.00 |
| 11 | 0.15 | 0.19 | 0.00 | 0.01 | 0.09 | 0.11 |
| 12 | 0.13 | 0.17 | 0.00 | 0.01 | 0.07 | 0.09 |
| 13 | 0.14 | 0.18 | 0.00 | 0.01 | 0.08 | 0.10 |
| 14 | 0.12 | 0.15 | 0.00 | 0.00 | 0.06 | 0.08 |
| 15 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
| 16 | 0.16 | 0.20 | 0.00 | 0.01 | 0.10 | 0.12 |
| 17 | 0.26 | 0.31 | 0.03 | 0.05 | 0.18 | 0.20 |
| 18 | 0.00 | 0.01 | 0.08 | 0.05 | 0.00 | 0.00 |
| 19 | 0.21 | 0.26 | 0.02 | 0.03 | 0.14 | 0.16 |
| 20 | 0.12 | 0.15 | 0.00 | 0.00 | 0.06 | 0.08 |
| 21 | 0.21 | 0.26 | 0.02 | 0.03 | 0.14 | 0.16 |
Look at the disadvantage:
- that 2 controls with the same \(\hat{e}(\mathbf{x})\) may have different
patterns of covariates, \(\mathbf{x}\),
and this is ignored
\(\Rightarrow\) In the \(1^{st}\) row and \(3^{rd}\) and \(4^{th}\) columns of Table above, the
distance is 0 to two decimal places. Actually, the distances between
welder #1 and potential controls #3 and #4 are, respectively, \((0.46−0.47)^2 = 0.0001\) and \((0.46−0.42)^2 = 0.0016,\) so control #3 is
ever so slightly closer to welder #1.
- However, in terms of the details of \(\textbf{x}\), control #4
looks to be the better match, a nonsmoker with a two-year
difference in age, as opposed to control #3, a
smoker with a six-year difference in age.
\(\rightarrow\) Because younger smokers
are more common in the welder group, the PS is indifferent
between a younger nonsmoker and an older smoker, but the details of
\(\textbf{x}\) suggest that
control #4 is a better match for welder #1 than is
control #3.
PSAn alternative distance (Paul R. Rosenbaum and Rubin 1985):
PS, \(\hat{e}(\textbf{x})\), but once this is
achieved, the details of \(\textbf{x}\)
affect the distance.caliper of width \(w\), if two individuals, say \(k\) and \(l\),
PS that differ by more than \(w\) i.e. \(|
\hat{e}(\textbf{x}_k) - \hat{e}(\textbf{x}_l) | > w\) \(\rightarrow\) set \(w\) = \(\infty\)sd.ps <- round(sd(pscore),3) # [1] 0.172
w <- round(sd.ps/2,3)
paste0("The standard deviation of ps: ", sd.ps)[1] "The standard deviation of ps: 0.172"
paste0("The caliper w (half of sd of ps): ", sd.ps, ", used to demonstate in this post.")[1] "The caliper w (half of sd of ps): 0.172, used to demonstate in this post."
PS, so
that by varying the multiplier, one can vary the relative importance
given to \(\hat{e}(\textbf{x})\) and
\(\textbf{x}\).
PS, in which the caliper is half of the standard deviation
of the PS, or 0.172/2 = 0.086.In problems of practical size, a
caliperof 20% of the sd ofpsis more common, and even that may be too large. A reasonable strategy: to start with acaliperof 20% of the sd ofps\(\rightarrow\) adjusting the caliper if needed to obtain balance on the propensity score.
If \(\hat{\Sigma}\) is the sample covariance matrix of \(\textbf{x}\), then the estimated Mahalanobis distance between \(\textbf{x}_k\) and \(\textbf{x}_l\) is \[ (\textbf{x}_k - \textbf{x}_l)^T \hat{\Sigma}^{-1} (\textbf{x}_k - \textbf{x}_l) \]
var.m <-gen.tox %>%
dplyr::select(Age, Race, Smoker) %>%
unname() %>%
as.matrix()# sample variance covariance matrix
cov.m <- cov(var.m)
round(cov.m,2) [,1] [,2] [,3]
[1,] 54.04 0.39 -0.94
[2,] 0.39 0.13 0.02
[3,] -0.94 0.02 0.25
# inverse of sample variance covariance matrix
cov.m.inv <- inv(cov.m)
round(cov.m.inv,3) [,1] [,2] [,3]
[1,] 0.021 -0.077 0.084
[2,] -0.077 8.127 -1.009
[3,] 0.084 -1.009 4.407
dmat
Rubin (1980)Distance matrix would have 21 treated subject (Welders) and 26
controls; the value in row \(i\) and
column \(j\) of the table is the
distance between the \(i^{th}\) treated subject and the \(j^{th}\) potential control.
var.m.w <-gen.tox %>%
filter(Group=="Welders") %>%
dplyr::select(Age, Race, Smoker) %>%
as.matrix()
var.m.c <-gen.tox %>%
filter(Group=="Controls") %>%
dplyr::select(Age, Race, Smoker) %>%
as.matrix()
m.d.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
for (j in 1:26){
m.d.m[i,j] <- t(var.m.w[i,] - var.m.c[j,]) %*% cov.m.inv %*% (var.m.w[i,] - var.m.c[j,])
}
}
# Mahalanobis distance
Welder <- c(1:21)
d.m.d.m <- as.data.frame(m.d.m)
names(d.m.d.m) <- paste0("Control ", 1:26)
d.d.m.d.n <- cbind(Welder, d.m.d.m)
kable(round(d.d.m.d.n[,1:7],2), caption = "Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
|---|---|---|---|---|---|---|
| 1 | 8.65 | 12.82 | 6.15 | 0.08 | 2.95 | 4.02 |
| 2 | 7.84 | 7.40 | 4.41 | 0.33 | 0.74 | 1.31 |
| 3 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
| 4 | 6.51 | 28.55 | 12.29 | 11.42 | 16.20 | 17.65 |
| 5 | 13.85 | 15.80 | 1.66 | 4.08 | 6.51 | 7.49 |
| 6 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
| 7 | 13.95 | 26.58 | 13.18 | 3.47 | 10.85 | 12.82 |
| 8 | 13.46 | 9.27 | 0.02 | 5.09 | 4.24 | 4.56 |
| 9 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
| 10 | 0.51 | 19.40 | 14.89 | 7.85 | 10.20 | 11.17 |
| 11 | 13.31 | 10.65 | 0.18 | 4.59 | 4.56 | 5.05 |
| 12 | 9.24 | 14.95 | 7.06 | 0.33 | 4.02 | 5.25 |
| 13 | 9.60 | 16.08 | 7.57 | 0.51 | 4.61 | 5.93 |
| 14 | 8.92 | 13.87 | 6.58 | 0.18 | 3.47 | 4.61 |
| 15 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
| 16 | 10.00 | 17.25 | 8.13 | 0.74 | 5.25 | 6.65 |
| 17 | 13.85 | 15.80 | 1.66 | 4.08 | 6.51 | 7.49 |
| 18 | 9.41 | 2.05 | 4.56 | 3.47 | 0.18 | 0.02 |
| 19 | 13.40 | 13.04 | 0.74 | 4.15 | 5.35 | 6.08 |
| 20 | 8.92 | 13.87 | 6.58 | 0.18 | 3.47 | 4.61 |
| 21 | 13.40 | 13.04 | 0.74 | 4.15 | 5.35 | 6.08 |
Another way to produce the Mahalanobis distance matrix: look at the
source code of mahal function.
source("mahal.R")
print(mahal)function (z, X)
{
X <- as.matrix(X)
n <- dim(X)[1]
rownames(X) <- 1:n
k <- dim(X)[2]
m <- sum(z)
cv <- cov(X)
out <- matrix(NA, m, n - m)
Xt <- X[z == 1, ]
Xc <- X[z == 0, ]
rownames(out) <- rownames(X)[z == 1]
colnames(out) <- rownames(X)[z == 0]
require(MASS)
icov <- ginv(cv)
for (i in 1:m) {
out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
}
out
}
PS calipers.Rows are the 21 welders and columns are for the first 6 of 26 potential controls. An \(\infty\) signifies that the caliper is violated.
m.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
for (j in 1:26){
m.m[i,j] <- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m[i,j])
}
}
Welder <- c(1:21)
d.m.m <- as.data.frame(m.m)
names(d.m.m) <- paste0("Control ", 1:26)
d.d.m.m <- cbind(Welder, d.m.m)
kable(round(d.d.m.m[,1:7],2), caption = "Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
|---|---|---|---|---|---|---|
| 1 | Inf | Inf | 6.15 | 0.08 | Inf | Inf |
| 2 | Inf | Inf | Inf | 0.33 | Inf | Inf |
| 3 | Inf | Inf | Inf | Inf | Inf | Inf |
| 4 | Inf | Inf | 12.29 | Inf | Inf | Inf |
| 5 | Inf | Inf | Inf | Inf | Inf | Inf |
| 6 | Inf | Inf | Inf | Inf | Inf | Inf |
| 7 | Inf | Inf | Inf | Inf | Inf | Inf |
| 8 | Inf | Inf | 0.02 | 5.09 | Inf | Inf |
| 9 | Inf | Inf | Inf | Inf | Inf | Inf |
| 10 | 0.51 | Inf | Inf | Inf | 10.20 | 11.17 |
| 11 | Inf | Inf | 0.18 | Inf | Inf | Inf |
| 12 | Inf | Inf | 7.06 | 0.33 | Inf | Inf |
| 13 | Inf | Inf | 7.57 | Inf | Inf | Inf |
| 14 | Inf | Inf | 6.58 | 0.18 | Inf | Inf |
| 15 | Inf | Inf | Inf | Inf | Inf | Inf |
| 16 | Inf | Inf | 8.13 | Inf | Inf | Inf |
| 17 | Inf | Inf | Inf | Inf | Inf | Inf |
| 18 | 9.41 | Inf | Inf | Inf | 0.18 | 0.02 |
| 19 | Inf | Inf | Inf | Inf | Inf | Inf |
| 20 | Inf | Inf | 6.58 | 0.18 | Inf | Inf |
| 21 | Inf | Inf | Inf | Inf | Inf | Inf |
Mahalanobis distance was originally developed for
use with multivariate Normal data
\(\rightarrow\) With data that are not
Normal, Mahalanobis distance can exhibit some rather odd
behavior.
If one covariate
\(\rightarrow\) its standard deviation will be inflated, and \(\rightarrow\) Mahalanobis distance will tend to ignore that covariate in matching.
\(\rightarrow\)
Mahalanobis distance gives greater weight to binary
variables with probabilities near 0 or 1 than to binary variables with
probabilities closer to 1/2.
welder data,
Age and Race
but different smoking behavior have a Mahalanobis distance
of 4.41,Age and
Smoking behavior but different Race have a
Mahalanobis distance of 8.06,Race is counted as almost twice as bad as a mismatch on
Smoking.Age with the same
Race and Smoking would have a
Mahalanobis distance of 0.021 × \(20^2\) = 8.204232,Race counts about as much as a 20-year difference in
Age.\(\Longrightarrow\) In many contexts, rare binary covariates are not of overriding importance, and outliers do not make a covariate unimportant \(\rightarrow\)
Mahalanobis distancemay not be appropriate with covariates of this kind.
rank-based Mahalanobis distance### Note that the covariance matrix now is on the rank-based matrix
data.r <- gen.tox %>%
dplyr::select(Age, Race, Smoker) %>%
as.matrix()
for (j in 1:dim(data.r)[2]){data.r[,j] <- rank(data.r[,j])}
cov.m.r <- cov(data.r)
## Step 1
var.m.w.r <- data.r[gen.tox$Group=="Welders",]
var.m.c.r <- data.r[gen.tox$Group=="Controls",]
## Step 2:
var.untied <- var(1:dim(gen.tox)[1])
rat <- sqrt(var.untied/diag(cov.m.r))
cov.m.r.adjusted <- diag(rat) %*% cov.m.r %*% diag(rat)
cov.m.r.adjusted.inv <- inv(cov.m.r.adjusted)
m.d.m.r <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
for (j in 1:26){
m.d.m.r[i,j] <- t(var.m.w.r[i,] - var.m.c.r[j,]) %*% cov.m.r.adjusted.inv %*% (var.m.w.r[i,] - var.m.c.r[j,])
}
}
# rank-based Mahalanobis distance
Welder <- c(1:21)
d.m.d.m.r <- as.data.frame(m.d.m.r)
names(d.m.d.m.r) <- paste0("Control ", 1:26)
d.d.m.d.m.r <- cbind(Welder, d.m.d.m.r)
kable(round(d.d.m.d.m.r[,1:7],2), caption = "Rank-based Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
|---|---|---|---|---|---|---|
| 1 | 4.61 | 4.40 | 5.98 | 0.33 | 2.69 | 3.21 |
| 2 | 2.98 | 0.70 | 3.21 | 0.47 | 0.15 | 0.28 |
| 3 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
| 4 | 8.14 | 14.80 | 10.43 | 7.69 | 12.17 | 13.00 |
| 5 | 9.03 | 8.49 | 3.93 | 3.66 | 6.55 | 7.15 |
| 6 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
| 7 | 10.20 | 12.31 | 12.86 | 3.93 | 9.30 | 10.26 |
| 8 | 6.77 | 3.29 | 0.04 | 3.92 | 2.99 | 3.05 |
| 9 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
| 10 | 0.25 | 4.72 | 7.61 | 3.03 | 3.72 | 4.01 |
| 11 | 6.71 | 3.79 | 0.28 | 3.38 | 3.18 | 3.34 |
| 12 | 5.86 | 6.33 | 7.61 | 0.98 | 4.24 | 4.89 |
| 13 | 6.98 | 7.96 | 9.02 | 1.68 | 5.59 | 6.33 |
| 14 | 5.25 | 5.41 | 6.83 | 0.64 | 3.49 | 4.08 |
| 15 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
| 16 | 8.30 | 9.78 | 10.61 | 2.56 | 7.12 | 7.96 |
| 17 | 9.03 | 8.49 | 3.93 | 3.66 | 6.55 | 7.15 |
| 18 | 3.33 | 0.05 | 3.00 | 1.68 | 0.05 | 0.01 |
| 19 | 7.34 | 5.62 | 1.58 | 2.99 | 4.34 | 4.72 |
| 20 | 5.25 | 5.41 | 6.83 | 0.64 | 3.49 | 4.08 |
| 21 | 7.34 | 5.62 | 1.58 | 2.99 | 4.34 | 4.72 |
Another way to produce the rank-based Mahalanobis distance matrix:
look at the source code of smahal function.
source("smahal.R")
print(smahal)function (z, X)
{
X <- as.matrix(X)
n <- dim(X)[1]
rownames(X) <- 1:n
k <- dim(X)[2]
m <- sum(z)
for (j in 1:k) {
X[, j] <- rank(X[, j])
}
cv <- cov(X)
vuntied <- var(1:n)
rat <- sqrt(vuntied/diag(cv))
cv <- diag(rat) %*% cv %*% diag(rat)
out <- matrix(NA, m, n - m)
Xc <- X[z == 0, ]
Xt <- X[z == 1, ]
rownames(out) <- rownames(X)[z == 1]
colnames(out) <- rownames(X)[z == 0]
library(MASS)
icov <- ginv(cv)
for (i in 1:m) {
out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
}
out
}
welder data,
Age
and Race but different smoking behavior have a
rank-based Mahalanobis distance of 3.21 (4.41 for a Mahalanobis
distance),Age and Smoking behavior but different
Race have a rank-based Mahalanobis distance of 3.07 (8.06
for a Mahalanobis distance),\(\Longrightarrow\) a mismatch on
raceis counted as about equal to a mismatch onsmoking.
Rank-based Mahalanobis distances within Propensity Score
calipers. An Inf signifies that the caliper is violated
m.m.r <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
for (j in 1:26){
m.m.r[i,j] <- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m.r[i,j])
}
}
Welder <- c(1:21)
d.m.m.r <- as.data.frame(m.m.r)
names(d.m.m.r) <- paste0("Control ", 1:26)
d.d.m.m.r <- cbind(Welder, d.m.m.r)
kable(round(d.d.m.m.r[,1:7],2), caption = "rank-based Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
|---|---|---|---|---|---|---|
| 1 | Inf | Inf | 5.98 | 0.33 | Inf | Inf |
| 2 | Inf | Inf | Inf | 0.47 | Inf | Inf |
| 3 | Inf | Inf | Inf | Inf | Inf | Inf |
| 4 | Inf | Inf | 10.43 | Inf | Inf | Inf |
| 5 | Inf | Inf | Inf | Inf | Inf | Inf |
| 6 | Inf | Inf | Inf | Inf | Inf | Inf |
| 7 | Inf | Inf | Inf | Inf | Inf | Inf |
| 8 | Inf | Inf | 0.04 | 3.92 | Inf | Inf |
| 9 | Inf | Inf | Inf | Inf | Inf | Inf |
| 10 | 0.25 | Inf | Inf | Inf | 3.72 | 4.01 |
| 11 | Inf | Inf | 0.28 | Inf | Inf | Inf |
| 12 | Inf | Inf | 7.61 | 0.98 | Inf | Inf |
| 13 | Inf | Inf | 9.02 | Inf | Inf | Inf |
| 14 | Inf | Inf | 6.83 | 0.64 | Inf | Inf |
| 15 | Inf | Inf | Inf | Inf | Inf | Inf |
| 16 | Inf | Inf | 10.61 | Inf | Inf | Inf |
| 17 | Inf | Inf | Inf | Inf | Inf | Inf |
| 18 | 3.33 | Inf | Inf | Inf | 0.05 | 0.01 |
| 19 | Inf | Inf | Inf | Inf | Inf | Inf |
| 20 | Inf | Inf | 6.83 | 0.64 | Inf | Inf |
| 21 | Inf | Inf | Inf | Inf | Inf | Inf |
optimal pair matching pairs each treated subject
with a different control to minimize the total distance within matched
pairs.
Hansen’s pairmatch function in his optmatch
package makes Bertsekas’ Fortran code (aka ‘assignment
problem’ solved by Kuhn in 1955) available from inside the
statistical package R;
The SAS program proc assign also solves the assignment
problem.
PSctrl <- NULL
k <- NULL
d <- d.m.d.n[-1]
for (i in 1:21){
k <- which.min(d[i,])
ctrl[i] <- names(k)
d <- d[-k]
}
index <- base::order(ctrl)
Welders <- gen.tox %>%
filter(Group=="Welders") %>%
dplyr::select(Age, Race, Smoker, ps)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps")]
Controls <- gen.tox %>%
filter(Group=="Controls") %>%
dplyr::select(Age, Race, Smoker, ps)
Controls$id <- paste0("Control ",1:26)
matched.controls <- Controls[which(Controls$id %in% ctrl),]
# Reorder follow the ctrl match order
matched.controls$id <- reorder.factor(matched.controls$id, new.order=ctrl)
#matched.controls <- matched.controls[-5]
matched.controls <- matched.controls %>% dplyr::arrange(id)
kable(cbind(Welders,matched.controls), caption="Pair match using the closed measure in the PS") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Pair | Age | Race | Smoker | ps | Age | Race | Smoker | ps | id |
|---|---|---|---|---|---|---|---|---|---|
| Pair 1 | 38 | 0 | 0 | 0.46 | 44 | 0 | 1 | 0.47 | Control 3 |
| Pair 2 | 44 | 0 | 0 | 0.34 | 47 | 0 | 0 | 0.28 | Control 8 |
| Pair 3 | 39 | 0 | 1 | 0.57 | 34 | 0 | 0 | 0.54 | Control 10 |
| Pair 4 | 33 | 1 | 1 | 0.51 | 38 | 0 | 0 | 0.46 | Control 9 |
| Pair 5 | 35 | 0 | 1 | 0.65 | 36 | 0 | 1 | 0.64 | Control 12 |
| Pair 6 | 39 | 0 | 1 | 0.57 | 31 | 1 | 1 | 0.55 | Control 15 |
| Pair 7 | 27 | 0 | 0 | 0.68 | 36 | 0 | 1 | 0.64 | Control 18 |
| Pair 8 | 43 | 0 | 1 | 0.49 | 40 | 0 | 0 | 0.42 | Control 4 |
| Pair 9 | 39 | 0 | 1 | 0.57 | 35 | 0 | 0 | 0.52 | Control 20 |
| Pair 10 | 43 | 1 | 0 | 0.20 | 48 | 1 | 0 | 0.14 | Control 1 |
| Pair 11 | 41 | 0 | 1 | 0.53 | 39 | 0 | 1 | 0.57 | Control 22 |
| Pair 12 | 36 | 0 | 0 | 0.50 | 42 | 0 | 0 | 0.38 | Control 11 |
| Pair 13 | 35 | 0 | 0 | 0.52 | 41 | 0 | 0 | 0.40 | Control 13 |
| Pair 14 | 37 | 0 | 0 | 0.48 | 42 | 0 | 0 | 0.38 | Control 24 |
| Pair 15 | 39 | 0 | 1 | 0.57 | 30 | 0 | 0 | 0.63 | Control 25 |
| Pair 16 | 34 | 0 | 0 | 0.54 | 35 | 0 | 1 | 0.65 | Control 26 |
| Pair 17 | 35 | 0 | 1 | 0.65 | 34 | 0 | 1 | 0.67 | Control 21 |
| Pair 18 | 53 | 0 | 0 | 0.19 | 50 | 0 | 0 | 0.23 | Control 5 |
| Pair 19 | 38 | 0 | 1 | 0.60 | 41 | 1 | 1 | 0.35 | Control 14 |
| Pair 20 | 37 | 0 | 0 | 0.48 | 44 | 0 | 0 | 0.34 | Control 19 |
| Pair 21 | 38 | 0 | 1 | 0.60 | 45 | 0 | 0 | 0.32 | Control 23 |
m.age.mps <- round(mean(matched.controls$Age),0)
aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)
s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)
ps.mps <- round(mean(matched.controls$ps),2)Welders — Matched Controls
| Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
|---|---|---|---|---|---|---|---|---|
| 38 | 10 | 52 | 0.51 | — | 40 | 14 | 38 | 0.46 |
\(\Longrightarrow\) the marginal means the matching above are well balanced as expected
Next, I use the pairmatch function in
optmatch package to compare.
gen.tox.op <- gen.tox
gen.tox.op$Group1 <- as.numeric(ifelse(gen.tox.op$Group=="Welders", 1,0))
#pairmatch(match_on( Group~ps, data=gen.tox ) )# did not run
pm <- pairmatch(Group1 ~ ps, data = gen.tox.op)
print(pairmatch(Group1 ~ ps, data = gen.tox.op), grouped = TRUE) # knowing the ordered id of treated and controls Group Members
1.1 1, 35
1.10 18, 27
1.11 19, 46
1.12 2, 29
1.13 20, 34
1.14 21, 39
1.15 3, 41
1.16 4, 30
1.17 5, 47
1.18 6, 43
1.19 7, 42
1.2 10, 26
1.20 8, 45
1.21 9, 31
1.3 11, 25
1.4 12, 40
1.5 13, 44
1.6 14, 24
1.7 15, 36
1.8 16, 32
1.9 17, 33
summary(pairmatch(Group1 ~ ps, data = gen.tox.op))Structure of matched sets:
1:1 0:1
21 5
Effective Sample Size: 21
(equivalent number of matched pairs).
#all.equal(names(pm), row.names(gen.tox))
## Housekeeping
ipm <- as.integer(pm)
gen.tox.op <- cbind(gen.tox.op, matches=pm, ipm)
data.pairmatch<-gen.tox.op[matched(pm),] # only select matched cases
Welders <- data.pairmatch %>%
filter(Group=="Welders") %>%
arrange(ipm) %>%
dplyr::select(Age, Race, Smoker, ps, ipm)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm")]
matched.controls <- data.pairmatch %>%
filter(Group=="Controls") %>%
arrange(ipm) %>%
dplyr::select(Age, Race, Smoker, ps, ipm)
kable(cbind(Welders,matched.controls), caption="Optimal pair match using the squared difference in the PS") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Pair | Age | Race | Smoker | ps | ipm | Age | Race | Smoker | ps | ipm | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Pair 1 | 38 | 0 | 0 | 0.46 | 1 | 41 | 1 | 1 | 0.35 | 1 |
| 18 | Pair 2 | 53 | 0 | 0 | 0.19 | 2 | 52 | 0 | 0 | 0.20 | 2 |
| 19 | Pair 3 | 38 | 0 | 1 | 0.60 | 3 | 30 | 0 | 0 | 0.63 | 3 |
| 2 | Pair 4 | 44 | 0 | 0 | 0.34 | 4 | 47 | 0 | 0 | 0.28 | 4 |
| 20 | Pair 5 | 37 | 0 | 0 | 0.48 | 5 | 41 | 0 | 0 | 0.40 | 5 |
| 21 | Pair 6 | 38 | 0 | 1 | 0.60 | 6 | 36 | 0 | 1 | 0.64 | 6 |
| 3 | Pair 7 | 39 | 0 | 1 | 0.57 | 7 | 35 | 0 | 0 | 0.52 | 7 |
| 4 | Pair 8 | 33 | 1 | 1 | 0.51 | 8 | 38 | 0 | 0 | 0.46 | 8 |
| 5 | Pair 9 | 35 | 0 | 1 | 0.65 | 9 | 35 | 0 | 1 | 0.65 | 9 |
| 6 | Pair 10 | 39 | 0 | 1 | 0.57 | 10 | 39 | 0 | 1 | 0.57 | 10 |
| 7 | Pair 11 | 27 | 0 | 0 | 0.68 | 11 | 34 | 0 | 1 | 0.67 | 11 |
| 10 | Pair 12 | 43 | 1 | 0 | 0.20 | 12 | 50 | 0 | 0 | 0.23 | 12 |
| 8 | Pair 13 | 43 | 0 | 1 | 0.49 | 13 | 42 | 0 | 0 | 0.38 | 13 |
| 9 | Pair 14 | 39 | 0 | 1 | 0.57 | 14 | 34 | 0 | 0 | 0.54 | 14 |
| 11 | Pair 15 | 41 | 0 | 1 | 0.53 | 15 | 40 | 0 | 0 | 0.42 | 15 |
| 12 | Pair 16 | 36 | 0 | 0 | 0.50 | 16 | 44 | 0 | 0 | 0.34 | 16 |
| 13 | Pair 17 | 35 | 0 | 0 | 0.52 | 17 | 45 | 0 | 0 | 0.32 | 17 |
| 14 | Pair 18 | 37 | 0 | 0 | 0.48 | 18 | 44 | 0 | 1 | 0.47 | 18 |
| 15 | Pair 19 | 39 | 0 | 1 | 0.57 | 19 | 31 | 1 | 1 | 0.55 | 19 |
| 16 | Pair 20 | 34 | 0 | 0 | 0.54 | 20 | 42 | 0 | 0 | 0.38 | 20 |
| 17 | Pair 21 | 35 | 0 | 1 | 0.65 | 21 | 36 | 0 | 1 | 0.64 | 21 |
m.age.mps <- round(mean(matched.controls$Age),0)
aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)
s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)
ps.mps <- round(mean(matched.controls$ps),2)Welders — Matched Controls
| Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
|---|---|---|---|---|---|---|---|---|
| 38 | 10 | 52 | 0.51 | — | 40 | 10 | 38 | 0.46 |
\(\Longrightarrow\) the marginal means the matching above are fairly well balanced, but the individual pairs are often not matched for race or smoking.
Mahalanobis distance within PS calipersThe caliper is half the standard deviation of the PS, or
0.172/2 = 0.086
z<-1*(gen.tox$Group=="Welders")
aa<-gen.tox$Race
smoker=gen.tox$Smoker
age<-gen.tox$Age
x<-cbind(age,aa,smoker)
# Mahalanobis distances
dmat<-mahal(z,x)
# Impose propensity score calipers
prop<-gen.tox$ps # propensity score
# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
dmat<-addcaliper(dmat,z,prop,caliper=.5)
# Find the minimum distance match within propensity score calipers.
pm.cal<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)
ipm.cal <- as.integer(pm.cal)
gen.tox.op.caliper <- cbind(gen.tox, matches=pm.cal,ipm.cal)
data.pairmatch.cal<-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases
Welders <- data.pairmatch.cal %>%
filter(Group=="Welders") %>%
arrange(ipm.cal) %>%
dplyr::select(Age, Race, Smoker, ps,ipm.cal)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]
matched.controls <- data.pairmatch.cal %>%
filter(Group=="Controls") %>%
arrange(ipm.cal) %>%
dplyr::select(Age, Race, Smoker, ps, ipm.cal)
kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Pair | Age | Race | Smoker | ps | ipm.cal | Age | Race | Smoker | ps | ipm.cal | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Pair 1 | 38 | 0 | 0 | 0.46 | 1 | 44 | 0 | 0 | 0.34 | 1 |
| 18 | Pair 2 | 53 | 0 | 0 | 0.19 | 2 | 52 | 0 | 0 | 0.20 | 2 |
| 19 | Pair 3 | 38 | 0 | 1 | 0.60 | 3 | 34 | 0 | 0 | 0.54 | 3 |
| 2 | Pair 4 | 44 | 0 | 0 | 0.34 | 4 | 47 | 0 | 0 | 0.28 | 4 |
| 20 | Pair 5 | 37 | 0 | 0 | 0.48 | 5 | 42 | 0 | 0 | 0.38 | 5 |
| 21 | Pair 6 | 38 | 0 | 1 | 0.60 | 6 | 31 | 1 | 1 | 0.55 | 6 |
| 3 | Pair 7 | 39 | 0 | 1 | 0.57 | 7 | 36 | 0 | 1 | 0.64 | 7 |
| 4 | Pair 8 | 33 | 1 | 1 | 0.51 | 8 | 41 | 1 | 1 | 0.35 | 8 |
| 5 | Pair 9 | 35 | 0 | 1 | 0.65 | 9 | 35 | 0 | 1 | 0.65 | 9 |
| 6 | Pair 10 | 39 | 0 | 1 | 0.57 | 10 | 35 | 0 | 0 | 0.52 | 10 |
| 7 | Pair 11 | 27 | 0 | 0 | 0.68 | 11 | 30 | 0 | 0 | 0.63 | 11 |
| 10 | Pair 12 | 43 | 1 | 0 | 0.20 | 12 | 48 | 1 | 0 | 0.14 | 12 |
| 8 | Pair 13 | 43 | 0 | 1 | 0.49 | 13 | 45 | 0 | 0 | 0.32 | 13 |
| 9 | Pair 14 | 39 | 0 | 1 | 0.57 | 14 | 36 | 0 | 1 | 0.64 | 14 |
| 11 | Pair 15 | 41 | 0 | 1 | 0.53 | 15 | 44 | 0 | 1 | 0.47 | 15 |
| 12 | Pair 16 | 36 | 0 | 0 | 0.50 | 16 | 41 | 0 | 0 | 0.40 | 16 |
| 13 | Pair 17 | 35 | 0 | 0 | 0.52 | 17 | 40 | 0 | 0 | 0.42 | 17 |
| 14 | Pair 18 | 37 | 0 | 0 | 0.48 | 18 | 42 | 0 | 0 | 0.38 | 18 |
| 15 | Pair 19 | 39 | 0 | 1 | 0.57 | 19 | 39 | 0 | 1 | 0.57 | 19 |
| 16 | Pair 20 | 34 | 0 | 0 | 0.54 | 20 | 38 | 0 | 0 | 0.46 | 20 |
| 17 | Pair 21 | 35 | 0 | 1 | 0.65 | 21 | 34 | 0 | 1 | 0.67 | 21 |
m.age.mps <- round(mean(matched.controls$Age),0)
aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)
s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)
ps.mps <- round(mean(matched.controls$ps),2)Welders — Matched Controls
| Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
|---|---|---|---|---|---|---|---|---|
| 38 | 10 | 52 | 0.51 | — | 40 | 14 | 38 | 0.45 |
rank-based Mahalanobis distance within PS calipersmahal by
smahalz<-1*(gen.tox$Group=="Welders")
aa<-gen.tox$Race
smoker=gen.tox$Smoker
age<-gen.tox$Age
x<-cbind(age,aa,smoker)
# rank-based Mahalanobis distances
dmat<-smahal(z,x)
# Impose propensity score calipers
prop<-gen.tox$ps # propensity score
# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
dmat<-addcaliper(dmat,z,prop,caliper=.5)
# Find the minimum distance match within propensity score calipers.
pm.cal<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)
ipm.cal <- as.integer(pm.cal)
gen.tox.op.caliper <- cbind(gen.tox, matches=pm.cal,ipm.cal)
data.pairmatch.cal<-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases
Welders <- data.pairmatch.cal %>%
filter(Group=="Welders") %>%
arrange(ipm.cal) %>%
dplyr::select(Age, Race, Smoker, ps,ipm.cal)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]
matched.controls <- data.pairmatch.cal %>%
filter(Group=="Controls") %>%
arrange(ipm.cal) %>%
dplyr::select(Age, Race, Smoker, ps, ipm.cal)
kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Pair | Age | Race | Smoker | ps | ipm.cal | Age | Race | Smoker | ps | ipm.cal | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Pair 1 | 38 | 0 | 0 | 0.46 | 1 | 44 | 0 | 0 | 0.34 | 1 |
| 18 | Pair 2 | 53 | 0 | 0 | 0.19 | 2 | 52 | 0 | 0 | 0.20 | 2 |
| 19 | Pair 3 | 38 | 0 | 1 | 0.60 | 3 | 31 | 1 | 1 | 0.55 | 3 |
| 2 | Pair 4 | 44 | 0 | 0 | 0.34 | 4 | 47 | 0 | 0 | 0.28 | 4 |
| 20 | Pair 5 | 37 | 0 | 0 | 0.48 | 5 | 42 | 0 | 0 | 0.38 | 5 |
| 21 | Pair 6 | 38 | 0 | 1 | 0.60 | 6 | 34 | 0 | 0 | 0.54 | 6 |
| 3 | Pair 7 | 39 | 0 | 1 | 0.57 | 7 | 35 | 0 | 0 | 0.52 | 7 |
| 4 | Pair 8 | 33 | 1 | 1 | 0.51 | 8 | 41 | 1 | 1 | 0.35 | 8 |
| 5 | Pair 9 | 35 | 0 | 1 | 0.65 | 9 | 35 | 0 | 1 | 0.65 | 9 |
| 6 | Pair 10 | 39 | 0 | 1 | 0.57 | 10 | 36 | 0 | 1 | 0.64 | 10 |
| 7 | Pair 11 | 27 | 0 | 0 | 0.68 | 11 | 30 | 0 | 0 | 0.63 | 11 |
| 10 | Pair 12 | 43 | 1 | 0 | 0.20 | 12 | 48 | 1 | 0 | 0.14 | 12 |
| 8 | Pair 13 | 43 | 0 | 1 | 0.49 | 13 | 45 | 0 | 0 | 0.32 | 13 |
| 9 | Pair 14 | 39 | 0 | 1 | 0.57 | 14 | 36 | 0 | 1 | 0.64 | 14 |
| 11 | Pair 15 | 41 | 0 | 1 | 0.53 | 15 | 44 | 0 | 1 | 0.47 | 15 |
| 12 | Pair 16 | 36 | 0 | 0 | 0.50 | 16 | 41 | 0 | 0 | 0.40 | 16 |
| 13 | Pair 17 | 35 | 0 | 0 | 0.52 | 17 | 40 | 0 | 0 | 0.42 | 17 |
| 14 | Pair 18 | 37 | 0 | 0 | 0.48 | 18 | 42 | 0 | 0 | 0.38 | 18 |
| 15 | Pair 19 | 39 | 0 | 1 | 0.57 | 19 | 39 | 0 | 1 | 0.57 | 19 |
| 16 | Pair 20 | 34 | 0 | 0 | 0.54 | 20 | 38 | 0 | 0 | 0.46 | 20 |
| 17 | Pair 21 | 35 | 0 | 1 | 0.65 | 21 | 34 | 0 | 1 | 0.67 | 21 |
m.age.mps <- round(mean(matched.controls$Age),0)
aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)
s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)
ps.mps <- round(mean(matched.controls$ps),2)Welders — Matched Controls
| Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
|---|---|---|---|---|---|---|---|---|
| 38 | 10 | 52 | 0.51 | — | 40 | 14 | 38 | 0.45 |
In matching with multiple controls, each treated subject is matched to at least one, and possibly more than one, control:
a fixed ratio is to match each treated
subject to the same number of controls; for instance, pair matching in a
ratio of 1-to-1, or matching in a ratio of 1-to-2.a variable ratio: to allow the number of
controls to vary from one treated subject to another. In the welder data
above, allows the possibility of matching with multiple controls in a
variable ratio, and in particular to use all of the controls.The decision to match in fixed or
variable ratio that affects both:
Matching with a variable number of controls is slightly more complex:
variable controls minimizes the total distance between
treated subjects and controls inside matched sets.The principal advantage of matching in a fixed ratio:
direct adjustment to give unequal
weights to observations. This advantage will weigh heavily when
potential controls are abundant and the biases that must be removed are
not large (!?!).Several advantages to matching with a variable ratio Paul R. Rosenbaum (1989):
Finding an optimal match with variable controls is equivalent to solving a particular minimum-cost flow problem in a network (Paul R. Rosenbaum 1989).
In full matching, besides matching with a variable number of controls, i.e. a treated subject could be matched to one or more controls, the reverse situation is permitted: one control may be matched to several treated subjects (P. R. Rosenbaum 1991).
As in matching with variable controls, summary statistics for the control group in full matching must be directly adjusted.
Full matching can be vastly better than pair matching or matching with a variable number of controls. The matched sets would be quite homogeneous, the quite unequal set sizes can lead to some inefficiency.
In one specific sense, an optimal full matching is an optimal design for an observational study (P. R. Rosenbaum 1991). Specifically, define a stratification to be a partitioning of the subjects into groups or strata based on the covariates with the one requirement that each stratum must contain at least one treated subject and at least one control. The quality of a stratification might reasonably be judged by a weighted average of all the within strata distances between treated subjects and controls.
No matter which of these weightings is used, no matter what distance is used, there is always a full matching that minimizes this weighted average distance (P. R. Rosenbaum 1991).
m<-fullmatch(dmat, data=gen.tox, min.controls = 1/7,max.controls = 7)
length(m)[1] 47
sum(matched(m))[1] 47
# Housekeeping
im<-as.integer(m)
gen.tox.fm<-cbind(gen.tox,im)
g.t.m<-gen.tox[matched(m),]
Welders <- gen.tox.fm %>%
filter(Group=="Welders") %>%
arrange(im) %>%
dplyr::select(Age, Race, Smoker, ps, im)
#Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Age","Race","Smoker","ps","im")]
fmatched.controls <- gen.tox.fm %>%
filter(Group=="Controls") %>%
arrange(im) %>%
dplyr::select(Age, Race, Smoker, ps, im)
kable(list(Welders,fmatched.controls), caption="Optimal full match using the Mahalanobis distance within PS calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
|
|
(be continued)
Matching on one variable, the propensity score, tends to
produce treated and control groups that, in aggregate, are balanced with
respect to observed covariates;
\(\rightarrow\) Cons: individual pairs that are close on the propensity score may differ widely on specific covariates.
\(\rightarrow\) Solution: to form
closer pairs, a distance is used that penalizes
large differences on the propensity score, and then finds individual
pairs that are as close as possible.
Pairs or matched sets are constructed using an optimization
algorithm. Matching with variable controls and
full matching combine elements of matching with elements of
direct adjustment.
Full matching can often produce closer matches than
pair matching.
Chapter 9, Basic Tools of Multivariate Matching from Rosenbaum, Paul R. Design of Observational Studies. 2nd ed. 2020., Springer International Publishing, 2020, https://doi.org/10.1007/978-3-030-46405-9.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2022, April 19). HaiBiostat: Basic Tools of Multivariate Matching. Retrieved from https://hai-mn.github.io/posts/2022-04-19-Basic Tools of Multivariate Matching/
BibTeX citation
@misc{nguyen2022basic,
author = {Nguyen, Hai},
title = {HaiBiostat: Basic Tools of Multivariate Matching},
url = {https://hai-mn.github.io/posts/2022-04-19-Basic Tools of Multivariate Matching/},
year = {2022}
}